Word Segmentation needs change- From a linguist's view
نویسندگان
چکیده
The authors propose that we need some change for the current technology in Chinese word segmentation. We should have separate and different phases in the so-called segmentation. First of all, we need to limit segmentation only to the segmentation of Chinese characters instead of the so-called Chinese words. In character segmentation, we will extract all the information of each character. Then we start a phase called Chinese morphological processing (CMP). The first step of CMP is to do a combination of the separate characters and is then followed by post-segmentation processing, including all sorts of repetitive structures, Chinese-style abbreviations, recognition of pseudo-OOVs and their processing, etc. The most part of post-segmentation processing may have to be done by some rule-based sub-routines, thus we need change the current corpus-based methodology by merging with rule-based technique.
منابع مشابه
Word Segmentation Based on Estimation of Words from Examples
From a cognitive point of view, words can be recognized based on learned data which can be obtained from linguistic materials. Namely, people learn words from many examples which they meet. We propose a word segmentation algorithm based on estimated knowledge for words acquired from both local texts being processed and POS tagged corpus. In order to show the feasibility of our model, we apply i...
متن کاملWord segmentation in Persian continuous speech using F0 contour
Word segmentation in continuous speech is a complex cognitive process. Previous research on spoken word segmentation has revealed that in fixed-stress languages, listeners use acoustic cues to stress to de-segment speech into words. It has been further assumed that stress in non-final or non-initial position hinders the demarcative function of this prosodic factor. In Persian, stress is retract...
متن کاملKeyboard Logs as Natural Annotations for Word Segmentation
In this paper we propose a framework to improve word segmentation accuracy using input method logs. An input method is software used to type sentences in languages which have far more characters than the number of keys on a keyboard. The main contributions of this paper are: 1) an input method server that proposes word candidates which are not included in the vocabulary, 2) a publicly usable in...
متن کاملWord Segmentation of Vietnamese Texts: a Comparison of Approaches
We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010